Preliminaries

In this competition, we are trying to identify common diseases of cassava crops using data science and machine learning. Previous methods of disease detection require farmers to solicit the help of government-funded agricultural experts to visually inspect and diagnose the plants. This suffers from being labor-intensive, low-supply and costly. Instead, it would be preferred if an automated pipeline based on mobile-quality photos of the cassava leafs could be developed.

This competition provides a farmer-crowdsourced dataset, labeled by experts at the National Crops Resources Research Institute (NaCRRI).

In this kernel, I will present a quick EDA.

Dependencies

import numpy as np
import pandas as pd
import seaborn as sns
import albumentations as A
import matplotlib.pyplot as plt
import os, gc, cv2, random, warnings, math, sys, json, pprint, pdb

import tensorflow as tf
from tensorflow.keras import backend as K
import tensorflow_hub as hub

from sklearn.model_selection import train_test_split

warnings.simplefilter('ignore')

Setup

Tip: Adding seed helps reproduce results. Setting debug parameter wil run the model on smaller number of epochs to validate the architecture.
def seed_everything(seed=0):
    random.seed(seed)
    np.random.seed(seed)
    tf.random.set_seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    os.environ['TF_DETERMINISTIC_OPS'] = '1'
SEED = 16
DEBUG = False #@param {type:"boolean"}

warnings.simplefilter('ignore')
seed_everything(SEED)
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
Mounted at /content/gdrive
dataset_path = '/content/gdrive/MyDrive/1_AUSTIN CHEN/Data Scientist/Datasets/cassava-leaf-disease-classification'
os.chdir(dataset_path)
os.listdir(dataset_path)
['efficientnetb3_notop.h5',
 'label_num_to_disease_map.json',
 'sample_submission.csv',
 'train.csv',
 'cassava-leaf-disease-classification.zip',
 'test_images',
 'test_tfrecords',
 'train_images',
 'train_tfrecords',
 '.ipynb_checkpoints',
 '000_normalization.data-00000-of-00001',
 '000_normalization.index',
 'checkpoint',
 'best_model.h5',
 'submission.csv']

EDA

df = pd.read_csv(dataset_path + '/train.csv')
df.head()
image_id label
0 1000015157.jpg 0
1 1000201771.jpg 3
2 100042118.jpg 1
3 1000723321.jpg 1
4 1000812911.jpg 3

Check how many images are available in the training dataset and also check if each item in the training set are unique

print(f"There are {len(df)} train images")
len(df.image_id) == len(df.image_id.unique())
There are 21397 train images
True
(df.label.value_counts(normalize=True) * 100).plot.barh(figsize = (8, 5))
<matplotlib.axes._subplots.AxesSubplot at 0x7f22d3c93b00>
df['filename'] = df['image_id'].map(lambda x : dataset_path + '/train_images/' + x)
df = df.drop(columns = ['image_id'])
df = df.sample(frac=1).reset_index(drop=True)
df.head()
label filename
0 3 /content/gdrive/MyDrive/1_AUSTIN CHEN/Data Sci...
1 3 /content/gdrive/MyDrive/1_AUSTIN CHEN/Data Sci...
2 3 /content/gdrive/MyDrive/1_AUSTIN CHEN/Data Sci...
3 3 /content/gdrive/MyDrive/1_AUSTIN CHEN/Data Sci...
4 3 /content/gdrive/MyDrive/1_AUSTIN CHEN/Data Sci...
if DEBUG:
    _, df = train_test_split(
        df,
        test_size = 0.1,
        random_state=SEED,
        shuffle=True,
        stratify=df['label'])
with open(dataset_path + '/label_num_to_disease_map.json') as file:
  id2label = json.loads(file.read())
id2label
{'0': 'Cassava Bacterial Blight (CBB)',
 '1': 'Cassava Brown Streak Disease (CBSD)',
 '2': 'Cassava Green Mottle (CGM)',
 '3': 'Cassava Mosaic Disease (CMD)',
 '4': 'Healthy'}

In this case, we have 5 labels (4 diseases and healthy):

  1. Cassava Bacterial Blight (CBB)
  2. Cassava Brown Streak Disease (CBSD)
  3. Cassava Green Mottle (CGM)
  4. Cassava Mosaic Disease (CMD)
  5. Healthy

In this case label 3, Cassava Mosaic Disease (CMD) is the most common label. This imbalance may have to be addressed with a weighted loss function or oversampling. I might try this in a future iteration of this kernel or in a new kernel.

Let's check an example image to see what it looks like

from PIL import Image
img = Image.open(df[df.label==3]['filename'].iloc[0])
width, height = img.size
print(f"Width: {width}, Height: {height}")
Width: 800, Height: 600
img

EfficientNet

EfficientNet, first introduced in Tan and Le, 2019 is among the most efficient models (i.e. requiring least FLOPS for inference) that reaches state-of-the-art accracy on both imagenet and common image classification transfer learning tasks.

The smallest base model is similar to MnasNet, which reached near-SOTA with a significantly smaller model. By introducing a heuristic way to scale the model, EfficientNet provides a family of models (B0 to B7) that represents a good combination of efficiency and accuracy on a variety of scales. Such a scaling heuristics (compound-scaling, details see Tan and Le, 2019) allows the efficiency-oriented base model (B0) to surpass models at every scale, while avoiding extensive grid-search of hyperparameters.

A summary of the latest updates on the model is available at here, where various augmentation schemes and semi-supervised learning approaches are applied to further improve the imagenet performance of the models. These extensions of the model can be used by updating weights without changing model topology.

B0 to B7 variants of EfficientNet

(I will summarize the paper after finshing reading it)

Keras implementation of EfficientNet

An implementation of EfficientNet B0 to B7 has been shipped with tf.keras since TF2.3. To use EfficientNetB0 for classifying 1000 classes of images from imagenet, run:

from tensorflow.keras.applications import EfficientNetB0
model = EfficientNetB0(weights='imagenet')

The B0 model takes input images of shape (224,224,3), and the input data should range [0,255]. Normailzation is included as part of the model.

Because training EfficientNet on imagenet takes a tremendous amount of resources and several techniques that are not a part of the model architecture itself. Hence the Keras implementations by default loads pre-trained weights obtained via training with AutoAugment.

From B0 to B7 base model, the input shapes are different. Here is a list of input shpae expected for each model:

Base model resolution
EfficientNetB0 224
EfficientNetB1 240
EfficientNetB2 260
EfficientNetB3 300
EfficientNetB4 380
EfficientNetB5 456
EfficientNetB6 528
EfficientNetB7 600

When the model is intended for transfer learning, the Keras implementation provides a option to remove the top layers:

model = EfficientNetB0(include_top=False, weights='imagenet')

This option excludes the final Dense layer that turns 1280 features on the penultimate layer into prediction of the 1000 ImageNet classes. Replacing the top layer with custom layers allows using EfficientNet as a feature extractor in a transfer learning workflow.

Another argument in the model constructor worth noticing is drop_connect_rate which controls the dropout rate responsible for stochastic depth. This parameter serves as a toggle for extra regularization in finetuning, but does not affect loaded weights. For example, when stronger regularization is desired, try:

model = EfficientNetB0(weights='imagenet', drop_connect_rate=0.4)

The default value for drop_connect_rate is 0.2

Configuration

BASE_MODEL, IMG_SIZE = ("efficientnet_b3", 300) #param ["(\"efficientnet_b4\", 380)", "(\"efficientnet_b2\", 260)"] {type:"raw", allow-input: true}
BATCH_SIZE = 32 #param {type:"integer"}
IMG_SIZE = (IMG_SIZE, IMG_SIZE)
print("Using {} with input size {}".format(BASE_MODEL, IMG_SIZE))
Using efficientnet_b3 with input size (300, 300)

Loading data

After my quick and rough EDA, let's load the PIL Image to a Numpy array, so we can move on to data augmentation.

In fastai, they have item_tfms and batch_tfms defined for their data loader API. The item transforms performs a fairly large crop to 224 and also apply other standard augmentations (in aug_tranforms) at the batch level on the GPU. The batch size is set to 32 here.

Splitting

train_df, valid_df = train_test_split(
    df
    ,test_size = 0.2
    ,random_state = SEED
    ,shuffle = True
    ,stratify = df['label'])

Constructing Dataset

train_ds = tf.data.Dataset.from_tensor_slices(
    (train_df.filename.values,train_df.label.values))
valid_ds = tf.data.Dataset.from_tensor_slices(
    (valid_df.filename.values, valid_df.label.values))
adapt_ds = tf.data.Dataset.from_tensor_slices(
    train_df.filename.values)
for x,y in valid_ds.take(3):
  print(x, y)
tf.Tensor(b'/content/gdrive/MyDrive/1_AUSTIN CHEN/Data Scientist/Datasets/cassava-leaf-disease-classification/train_images/2484271873.jpg', shape=(), dtype=string) tf.Tensor(4, shape=(), dtype=int64)
tf.Tensor(b'/content/gdrive/MyDrive/1_AUSTIN CHEN/Data Scientist/Datasets/cassava-leaf-disease-classification/train_images/3704210007.jpg', shape=(), dtype=string) tf.Tensor(4, shape=(), dtype=int64)
tf.Tensor(b'/content/gdrive/MyDrive/1_AUSTIN CHEN/Data Scientist/Datasets/cassava-leaf-disease-classification/train_images/1655615998.jpg', shape=(), dtype=string) tf.Tensor(2, shape=(), dtype=int64)
AUTOTUNE = tf.data.experimental.AUTOTUNE

Important: At this point, you may have noticed that I have not used any kind of normalization or rescaling. I recently discovered that there is Normalization layer included in Keras’ pretrained EfficientNet, as mentioned here.

Item transformation

Basically item transformations mainly make sure the input data is of the same size so that it can be collated in batches.

def decode_image(filename):
  img = tf.io.read_file(filename)
  img = tf.image.decode_jpeg(img, channels=3)
  return img
  
def collate_train(filename, label):
  img = decode_image(filename)
  img = tf.image.random_brightness(img, 0.3)
  img = tf.image.random_flip_left_right(img, seed=None)
  img = tf.image.random_crop(img, size=[*IMG_SIZE, 3])
  return img, label

def process_adapt(filename):
  img = decode_image(filename)
  img = tf.keras.layers.experimental.preprocessing.Rescaling(1.0 / 255)(img)
  return img

def collate_valid(filename, label):
  img = decode_image(filename)
  img = tf.image.resize(img, [*IMG_SIZE])
  return img, label
train_ds = train_ds.map(collate_train, num_parallel_calls=AUTOTUNE)
valid_ds = valid_ds.map(collate_valid, num_parallel_calls=AUTOTUNE)
adapt_ds = adapt_ds.map(process_adapt, num_parallel_calls=AUTOTUNE)
def show_images(ds):
  _,axs = plt.subplots(4,6,figsize=(24,16))
  for ((x, y), ax) in zip(ds.take(24), axs.flatten()):
    ax.imshow(x.numpy().astype(np.uint8))
    ax.set_title(np.argmax(y))
    ax.axis('off')
show_images(train_ds)
show_images(valid_ds)

Batching Dataset

Note: I was shuffing the validation set which is a bug
train_ds_batch = (train_ds
                  .shuffle(buffer_size=1000)
                  .batch(BATCH_SIZE)
                  .prefetch(buffer_size=AUTOTUNE))

valid_ds_batch = (valid_ds
                  #.shuffle(buffer_size=1000)
                  .batch(BATCH_SIZE*2)
                  .prefetch(buffer_size=AUTOTUNE))

adapt_ds_batch = (adapt_ds
                  .shuffle(buffer_size=1000)
                  .batch(BATCH_SIZE)
                  .prefetch(buffer_size=AUTOTUNE))

Batch augmentation

data_augmentation = tf.keras.Sequential(
    [
      tf.keras.layers.experimental.preprocessing.RandomCrop(*IMG_SIZE),
      tf.keras.layers.experimental.preprocessing.RandomFlip("horizontal_and_vertical"),
      tf.keras.layers.experimental.preprocessing.RandomRotation(0.25),
      tf.keras.layers.experimental.preprocessing.RandomZoom((-0.2, 0)),
      tf.keras.layers.experimental.preprocessing.RandomContrast((0.2,0.2))
    ]
)
x = (train_ds
     .batch(BATCH_SIZE)
     .take(1)
     .map(lambda x,y : (data_augmentation(x), y),
          num_parallel_calls=AUTOTUNE))
show_images(x.unbatch())

Building a model

I am using an EfficientNetB3 on top of which I add some output layers to predict our 5 disease classes. I decided to load the imagenet pretrained weights locally to keep the internet off (part of the requirements to submit a kernal to this competition).

from tensorflow.keras.applications import EfficientNetB3
!wget https://storage.googleapis.com/keras-applications/efficientnetb3_notop.h5
--2020-12-18 07:16:06--  https://storage.googleapis.com/keras-applications/efficientnetb3_notop.h5
Resolving storage.googleapis.com (storage.googleapis.com)... 172.217.204.128, 172.217.203.128, 74.125.141.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|172.217.204.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 43941136 (42M) [application/x-hdf]
Saving to: ‘efficientnetb3_notop.h5.1’

efficientnetb3_noto 100%[===================>]  41.91M  52.1MB/s    in 0.8s    

2020-12-18 07:16:07 (52.1 MB/s) - ‘efficientnetb3_notop.h5.1’ saved [43941136/43941136]

efficientnet = EfficientNetB3(
    weights = dataset_path + "/efficientnetb3_notop.h5", 
    include_top = False, 
    input_shape = (*IMG_SIZE, 3), 
    drop_connect_rate = 0.4)
def build_model(base_model, num_class):
  inputs = tf.keras.layers.Input(shape=(*IMG_SIZE, 3))
  x = data_augmentation(inputs)
  
  x = base_model(x)

  # Rebuild top
  x = tf.keras.layers.GlobalAveragePooling2D(name="avg_pool")(x)
  #x = tf.keras.layers.BatchNormalization()(x)
  x = tf.keras.layers.Dropout(0.4, name="top_dropout")(x)
  outputs = tf.keras.layers.Dense(num_class, activation="softmax", name="pred")(x)

  model = tf.keras.models.Model(inputs=inputs, outputs=outputs)

  return model
model = build_model(base_model=efficientnet, num_class=len(id2label))
model.summary()
Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_4 (InputLayer)         [(None, 300, 300, 3)]     0         
_________________________________________________________________
sequential_5 (Sequential)    (None, 300, 300, 3)       0         
_________________________________________________________________
efficientnetb3 (Functional)  (None, 10, 10, 1536)      10783535  
_________________________________________________________________
avg_pool (GlobalAveragePooli (None, 1536)              0         
_________________________________________________________________
top_dropout (Dropout)        (None, 1536)              0         
_________________________________________________________________
pred (Dense)                 (None, 5)                 7685      
=================================================================
Total params: 10,791,220
Trainable params: 10,703,917
Non-trainable params: 87,303
_________________________________________________________________

Fine tune

The 3rd layer of the Efficient is the Normalization layer, which can be tuned to our new dataset instead of imagenet. Be patient on this one, it does take a bit of time as we're going through the entire training set.

%%time
model.get_layer('efficientnetb3').get_layer('normalization').adapt(adapt_ds_batch)
model.save_weights(filepath = dataset_path + "/000_normalization")

Training

I always wanted to try the new CosineDecay function implemented in tf.keras as it seemed promising and I struggled to find the right settings (if there were any) for the ReduceLROnPlateau

EPOCHS = 8
decay_steps = int(round(len(train_df)/BATCH_SIZE)) * EPOCHS
cosine_decay = tf.keras.experimental.CosineDecay(
    initial_learning_rate=1e-4,
    decay_steps=decay_steps,
    alpha=0.3)

callbacks = [
    tf.keras.callbacks.ModelCheckpoint(
        filepath='best_model.h5',
        monitor='val_loss',
        save_best_only=True)
    ]

model.compile(loss="sparse_categorical_crossentropy",
              optimizer=tf.keras.optimizers.Adam(cosine_decay),
              metrics=["accuracy"])
history = model.fit(train_ds_batch,
                    epochs = EPOCHS,
                    validation_data=valid_ds_batch,
                    callbacks=callbacks)

Evaluating

def plot_hist(hist):
  plt.plot(history.history['loss'])
  plt.plot(history.history['val_loss'])
  plt.title('Loss over epochs')
  plt.ylabel('loss')
  plt.xlabel('epoch')
  plt.legend(['train', 'valid'], loc='best')
  plt.show()
plot_hist(history)

We load the best weight that were kept from the training phase. Just to check how our model is performing, we will attempt predictions over the validation set. This can help to highlight any classes that will be consistently miscategorised.

model.load_weights('best_model.h5')

Prediction

x = train_df.sample(1).filename.values[0]
img = decode_image(x)
imgs = [tf.image.random_crop(img, size=[*IMG_SIZE, 3]) for _ in range(4)]

_,axs = plt.subplots(1,4,figsize=(16,4))
for (x, ax) in zip(imgs, axs.flatten()):
  ax.imshow(x.numpy().astype(np.uint8))
  ax.axis('off')

I apply some very basic test time augmentation to every local image extracted from the original 600-by-800 images. We know we can do some fancy augmentation with albumentations but I wanted to do that exclusively with Keras preprocessing layers to keep the cleanest pipeline possible.

tta = tf.keras.Sequential(
    [
        tf.keras.layers.experimental.preprocessing.RandomCrop((*IMG_SIZE)),
        tf.keras.layers.experimental.preprocessing.RandomFlip("horizontal_and_vertical"),
        tf.keras.layers.experimental.preprocessing.RandomZoom((-0.2, 0.2)),
        tf.keras.layers.experimental.preprocessing.RandomContrast((0.2,0.2))
    ]
)
def predict_tta(filename, num_tta=4):
  img = decode_image(filename)
  img = tf.expand_dims(img, 0)
  preds = []
  for _ in range(num_tta):
    img = tta(img)
    pred = model.predict(img)
    preds.append(pred)
  return np.array(preds).sum(0).argmax()
pred = predict_tta(df.sample(1).filename.values[0])
print(pred)
3
from tqdm import tqdm
preds = []
with tqdm(total=len(valid_df)) as pbar:
  for filename in valid_df.filename:
    pbar.update()
    preds.append(predict_tta(filename, num_tta=4))
100%|██████████| 4280/4280 [1:45:44<00:00,  1.48s/it]
cm = tf.math.confusion_matrix(valid_df.label.values, np.array(preds))
plt.figure(figsize=(10, 8))
sns.heatmap(cm,
            xticklabels=id2label.values(),
            yticklabels=id2label.values(), 
            annot=True,
            fmt='g')
plt.xlabel('Prediction')
plt.ylabel('Label')
plt.show()
test_folder = dataset_path + '/test_images/'
submission_df = pd.DataFrame(columns={"image_id","label"})
submission_df["image_id"] = os.listdir(test_folder)
submission_df["label"] = 0
submission_df['label'] = (submission_df['image_id']
                            .map(lambda x : predict_tta(test_folder+x)))
submission_df
label image_id
0 4 2216849948.jpg
submission_df.to_csv("submission.csv", index=False)

1% Better Everyday

reference


todos

  • Add a cell for checkbox parameter to select between kaggle and colab, default is Kaggle.
  • Can we do predict in batch in tensorflow.
  • Blog for how to customize customize metrics/losses/optimizer.
  • Learn more about the adapt function that being used to retrain the normalization layer of the EfficientNetB3.
  • Read the EfficientNet paper and summarize in one of the section of this notebook.
  • See if I can integrate the Cutmix/Mixup augmentations in the appendix into our existing notebook.
  • Hence the Keras implementations by default loads pre-trained weights obtained via training with AutoAugment. What does it mean by this comment?

done

  • Try out the data_generator and the data_frame_iterator
  • Removing normalizaiton step in generator since in EfficientNet, normalization is done within the model itself and the model expects input in the range of [0,255]
  • Find out the intuition and the difference between item_tfm and batch_tfm

    In fastai, item_tfm defines the transforms that are done on the CPU and batch_tfm defines those done on the GPU.

  • Customize my own data generator as fastai creates their Dataloader

    No need, things are much easier than what I was originally expecting. Please refer to the Loading data section in this notebook.

  • The 3rd layer of the Efficientnet is the Normalization layer, which can be tuned to our new dataset instead of imagenet. Be patient on this one, it does take a bit of time we're going through the entire training set.

  • Add seed_everything function

Appendix

The albumentation is primarily used for resizing and normalization.

def albu_transforms_train(data_resize): 
    return A.Compose([
            A.ToFloat(),
            A.Resize(data_resize, data_resize),
        ], p=1.)

# For Validation 
def albu_transforms_valid(data_resize): 
    return A.Compose([
            A.ToFloat(),
            A.Resize(data_resize, data_resize),
        ], p=1.)
def CutMix(image, label, DIM, PROBABILITY = 1.0):
    # input image - is a batch of images of size [n,dim,dim,3] not a single image of [dim,dim,3]
    # output - a batch of images with cutmix applied
    CLASSES = 5
    
    imgs = []; labs = []
    for j in range(len(image)):
        # DO CUTMIX WITH PROBABILITY DEFINED ABOVE
        P = tf.cast( tf.random.uniform([],0,1)<=PROBABILITY, tf.int32)
        
        # CHOOSE RANDOM IMAGE TO CUTMIX WITH
        k = tf.cast( tf.random.uniform([],0,len(image)),tf.int32)
        
        # CHOOSE RANDOM LOCATION
        x = tf.cast( tf.random.uniform([],0,DIM),tf.int32)
        y = tf.cast( tf.random.uniform([],0,DIM),tf.int32)
        
        b = tf.random.uniform([],0,1) # this is beta dist with alpha=1.0
        
        WIDTH = tf.cast( DIM * tf.math.sqrt(1-b),tf.int32) * P
        ya = tf.math.maximum(0,y-WIDTH//2)
        yb = tf.math.minimum(DIM,y+WIDTH//2)
        xa = tf.math.maximum(0,x-WIDTH//2)
        xb = tf.math.minimum(DIM,x+WIDTH//2)

        # MAKE CUTMIX IMAGE
        one = image[j,ya:yb,0:xa,:]
        two = image[k,ya:yb,xa:xb,:]
        three = image[j,ya:yb,xb:DIM,:]
        middle = tf.concat([one,two,three],axis=1)
        img = tf.concat([image[j,0:ya,:,:],middle,image[j,yb:DIM,:,:]],axis=0)
        imgs.append(img)
        
        # MAKE CUTMIX LABEL
        a = tf.cast(WIDTH*WIDTH/DIM/DIM,tf.float32)
        labs.append((1-a)*label[j] + a*label[k])
            
    # RESHAPE HACK SO TPU COMPILER KNOWS SHAPE OF OUTPUT TENSOR (maybe use Python typing instead?)
    image2 = tf.reshape(tf.stack(imgs),(len(image),DIM,DIM,3))
    label2 = tf.reshape(tf.stack(labs),(len(image),CLASSES))
    
    return image2,label2
def MixUp(image, label, DIM, PROBABILITY = 1.0):
    # input image - is a batch of images of size [n,dim,dim,3] not a single image of [dim,dim,3]
    # output - a batch of images with mixup applied
    CLASSES = 5
    
    imgs = []; labs = []
    for j in range(len(image)):
        # DO MIXUP WITH PROBABILITY DEFINED ABOVE
        P = tf.cast( tf.random.uniform([],0,1)<=PROBABILITY, tf.float32)
                   
        # CHOOSE RANDOM
        k = tf.cast( tf.random.uniform([],0,len(image)),tf.int32)
        a = tf.random.uniform([],0,1)*P # this is beta dist with alpha=1.0
                    
        # MAKE MIXUP IMAGE
        img1 = image[j,]
        img2 = image[k,]
        imgs.append((1-a)*img1 + a*img2)
                    
        # MAKE CUTMIX LABEL
        labs.append((1-a)*label[j] + a*label[k])
            
    # RESHAPE HACK SO TPU COMPILER KNOWS SHAPE OF OUTPUT TENSOR (maybe use Python typing instead?)
    image2 = tf.reshape(tf.stack(imgs),(len(image),DIM,DIM,3))
    label2 = tf.reshape(tf.stack(labs),(len(image),CLASSES))
    return image2,label2